Natu Lauchande June 11, 2019
Financial services are at the cornerstone of the modern society opportunity value chain for indivuals and business. Santander is a bank that thrives to know their customers better to serve them correctly. Part of providing the customers with the right financial choices is to be able to know and predict their desires [1] .
The topic of this project is to predict if bank customers will make a specific transaction in the future based on a set of anonymous features. Predicting customer propensity/suitability for transaction or willigness is paramount on bringing inclusion and low cost financial services for underprivilliged communities [6] .
Predicting if a customer for instance will make a specific transaction might help the financial institution provision resources if it makes business sense . A special important application of predicting transaction is in the context of financial fraud . Having a performant transaction fraud detection system can lower financial risk for institutions [7].
Being able to predict what transaction is more likely a banking user would make can also enable more ways to connect with customers: using appropriate digital channels, for example SMS and mobile, apps enables engagement with customers who have previously been unreachable by and therefore invisible to more traditional environments.
The fact that the current project focus on anonymized data( no identification of the variable names and basically numerical data) opens the possibility of using the outcomes of this project in diverse contexts aligned with data privacy.
The problem statement of this project is based on the Kaggle competition :
The goal of this project is to solve the problem of predicting wether a customer will make a given unidentified transaction given a set of features and historical data . The core of the project is to identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted [1].
The solution for this project will consistist of the following :
A set of metrics will be used to optimize and choose between the different models:
In terms of metrics to objectively compare the different models we decided to choose the
The Area Under Curve (AUC) of the Receiver Operator characteristics is a comparison of how well a binary classifier distinguishes between the two classes at play in the specific problem.
[7]
Important definitions :
$$ True Positive Rate (tpr) = {{Number of True Positives} \over {Number of True Positives + Number of False Negatives}}.$$
$$ False Positive Rate (tpr) = {{Number of False Positives} \over {Number of True Positives + Number of False Negatives}}.$$
The area under the curve is able to capture the quality measuring the quality of a scoring function. The best possible ROC curve has the are 1 the closer your ROC curve area is to 1 the better is the classifier. The maximum uncertainity classifier would yield an area of 0.5 and would be the diagonal line of the graph of the figure above [3].
Given that the current problem is a binary classifier the AUC ROC metric will allow us to compare correctly the quality of the different classifiers.
!pip install -U -q kaggle
!mkdir -p ~/.kaggle
from google.colab import files
files.upload()
!cp kaggle.json ~/.kaggle/
!kaggle competitions download -c santander-customer-transaction-prediction
!unzip train.csv.zip
!pip install catboost
from sklearn.metrics import accuracy_score, log_loss, confusion_matrix, precision_recall_fscore_support
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from keras.wrappers.scikit_learn import KerasClassifier
from keras.models import Sequential
from keras.layers import Dense
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
import xgboost as xgb
from xgboost import XGBClassifier
import catboost as catboost
from catboost import CatBoostClassifier
# Import the modules
import pandas as pd
import numpy as np
import sklearn as sk
# Data Vis
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
sns.set(style='white', context='notebook', palette='deep')
import matplotlib.style as style
style.use('fivethirtyeight')
# Get the data
train = pd.read_csv('train.csv')
# Change the settings so that you can see all columns of the dataframe when calling df.head()
pd.set_option('display.max_columns',20)
# Get target
target = 'target'
# Get quantitative features and delete the unnccessary features
features = [f for f in train.columns if train.dtypes[f] != 'object']
features.remove('target')
train.head()
train.info()
# Capture the necessary data
variables = train.columns
count = []
for variable in variables:
length = train[variable].count()
count.append(length)
count_pct = np.round(100 * pd.Series(count) / len(train), 2)
count = pd.Series(count)
missing = pd.DataFrame()
missing['variables'] = variables
missing['count'] = len(train) - count
missing['count_pct'] = 100 - count_pct
missing = missing[missing['count_pct'] > 0]
missing.sort_values(by=['count_pct'], inplace=True)
missing_train = np.array(missing['variables'])
#Plot number of available data per variable
plt.subplots(figsize=(15,6))
# Plots missing data in percentage
plt.subplot(1,2,1)
plt.barh(missing['variables'], missing['count_pct'])
plt.title('Count of missing training data in percent', fontsize=15)
# Plots total row number of missing data
plt.subplot(1,2,2)
plt.barh(missing['variables'], missing['count'])
plt.title('Count of missing training data as total records', fontsize=15)
plt.show()
The graph above displaying basically no missing data in the training dataset. What is perhaps good news in terms of less work during data preparation to impute missing varables.
Target description¶
This section describes the target variable and it's distribution . From the above data we clearly have a category imbalance problem in here. A ration of 1:9 between a target of taking a transaction and not taking a transaction.
# distribution of targets
colors = ['darkseagreen','lightcoral']
plt.figure(figsize=(6,6))
plt.pie(train["target"].value_counts(), explode=(0, 0.25), labels= ["0", "1"], startangle=45, autopct='%1.1f%%', colors=colors)
plt.axis('equal')
plt.show()
It can be seen form the table description below that some of the standard deviations below are very high . This dataset can definitely benefit from some normalization down the line.
train.describe()
This section aim is to give a visual glance of the dataset that we are handling in this project . Focusing on a visual tour of the features and it's inter relations.
The correlation analysis helps us understanding how 2 variables are related . It's clear from the figure below that there is very little correlation between the variables in the dataset.
# correlation with target
# Code adapted from : https://www.kaggle.com/yuzusan/santander-draft-v3-eda-lgb-nn
from scipy.stats import spearmanr
labels = []
values = []
for col in train.columns:
if col not in ['ID_code', 'target']:
labels.append(col)
values.append(spearmanr(train[col].values, train['target'].values)[0])
corr_df = pd.DataFrame({'col_labels': labels, 'corr_values' : values})
corr_df = corr_df.sort_values(by='corr_values')
corr_df = corr_df[(corr_df['corr_values']>0.03) | (corr_df['corr_values']<-0.03)]
# check covariance among importance variables
cols_to_use = corr_df[(corr_df['corr_values']>0.05) | (corr_df['corr_values']<-0.05)].col_labels.tolist()
temp_df = train[cols_to_use]
corrmat = temp_df.corr(method='spearman')
f, ax = plt.subplots(figsize=(10, 10))
#Draw the heatmap using seaborn
sns.heatmap(corrmat, vmax=1., square=True, cmap="Blues")
plt.title("Important variables correlation map", fontsize=15)
plt.show()
A correlation map above shows that the features are definitely not correlated so this particular problem won't win a lot from a dimensionality reduction pre-processing step.
In the benchmark section a simple logistic regression was run and benchmar and a feature importance list derived from there. A limited set of relevant features would allow us to have a look at differnet properties of the dataset.
top20_features = ["var_45","var_47","var_96","var_182","var_120","var_61","var_158","var_136","var_117","var_10","var_41","var_103","var_98","var_160",
"var_17","var_183","var_38","var_30"]
top20_features_target = top20_features+["target"]
train[top20_features_target].hist(figsize=(16, 20), xlabelsize=8, ylabelsize=8)
On the distribution graph above becomes very clear an underlying normal distribution of the important variables this . The target imbalance is definitely super distinct from the top 20 most important features.
Another important visualisation is the relation between the restricted set of relevant variables and the target pairwise :
sns.pairplot(train[top20_features_target].sample(frac=0.2), vars=top20_features, hue="target")
No significant conclusion can be driven from the insights provided from the visualisations above. Definitely confirming the low correlation between the variables showing just the distributionn between binary target variable.
As noticed before the training dataset suffers from class imbalance that needs to be addressed during model iteration and taken into consideration by mitigating using the most appropriate strategies(example : Smote, sampling, penalties, etc.)[8].
The general approach on solving the problem of this project is the following :
Apply standard data pre-processing :
* Data standarization
* Data normalization scaling
* Category imbalance mitigation
* Feature selection based on importance ranking
Run a baseline classifier pipeline on each stage of the data processing to detect improvements. Having a pipeline of classifiers ready to use facilitates the execution of the process and to denote improvements from the different standard techniques.
3.Select from step 2 the most promising classifier and execute further tuning.
The classifier pipeline consists of the following algorithms :
Logistic regression is one of the most classifical machine learning algorithms. It's a regression technique that instead of using a linear function to fit the training point it use a sigmoid function.[3]
Among main strenghts of the logistic regression are the following :
Naive bayes is a simple probabilistic classifier. In it's simplistic form is an implementation of the Bayes Theorem with the featureset with the "naive" assumption that features are not correlated. It's very common as a baseline for text classification problems. [10]
The main motive behind the choice for this particular problem is the fact that it's extremely fast and the low correlation on the feature set detected during Data Exploration.
Ensemble techniques are based on a combination of multiple weak learners that can be used for classification or regression . [11]
A common type of ensembles are the ones based on tree methods ( example : Gradient Boosting, Random Forests), presenting a combinations of trees that in particular are performant in one component of the target distribution .
Ensemble techniques are a very popular approach to a lot of popular tabular/transactional data problems wiith significant success across industries.
The random forest algorithm basically uses the mode or mean of the different trees identified by decision trees learned by random selections of the training data.
$\hat{f} = \frac{1}{B} \sum_{b=1}^Bf_b (x')$ [12]
The equation above represents the situation when we are averaging the data.
It's a technique based on Gradient Boosting that combines weak classifiers in sequence where each classifier is created where the data of mispredictions in the training data is augmented. The implementation is relatively fast and simple to use.
Catboost is a new generation of boosting frameworks that uses some new insights and has out of the optimization tuning and optimization. [13].
The main reason to choose this technique as one of the candidates for this project was to evaluate also the capabilities of this new framework.
A classifical technique recently populalirized by the industry success of Deep Learning techniques for perception data ( audio, video, speech) and natural language problems.
Defined as collection of connected units or nodes called artificial neurons, each connection can transmit a signal from one artificial neuron to another. Each neuron receives a signal from an external source ( another neuron or input data) and is able to learns an expected classification in a supervised learning setting. An artificial neuron mimics the working of a biophysical neuron with inputs and outputs. [3]
The two techniques described below ( Data Scaling, and Handling imbalanced datasets ) are basically data processing techniques executed to improve the performance of machine learning algorithms on the training data and will be applied after a given baseline .
Machine learning algorithms and techniques generally assume out of the box the input being of standard normal distributed data. The scaling of data facilitates optimization techniques and improves learning on standardized data. Sklearn provides a simple method that will be used in this project : StandardScaler [15]
Given the fact that our initial dataset is highly unballanced some techniques must be used to improve the accuracy rate of the dataset.
Most of the inspiration on the techniques used in this project involved looking at the following project : https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
The benchmark for this classification is based on a simple out of the box Logistic Regression binary classification over the data implemented with the intention of improving the model with insights gained over the execution of the project.
Another possibility of a benchmark could be the Kaggle Competition Leaderboard in itself, after a careful evaluation of some of the submissions it became clear to me that most of the solutions on the Kaggle projects are based on deep expertise and specialized competition specfic knowledge ( stacking , over fitting and feature engineering). In line with broadening personal knowledge in Data Science and Machine Learnign there is a deliberate choice on this project for a simpler benchmark that will allow to improve on knowledge gained over the Nanodegree.
Benchmark model will the out of the box vanilla logistic regression outlined below:
from sklearn.linear_model import LogisticRegression
X=train[features]
y=train[target]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
X_train.head()
clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_probas = clf.predict_proba(X_test)
A standard 70/30 ratio was chosen to split between testing and training data. This split will and training data ratio is used throughout the execute the project.
To calculate the benchmark metrics sklearn metric package was used :
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, y_pred)
Basically the benchmark that we are looking forward to beat is an AUC that is better than 0.6089.
One extra advantage of the logistic regression approach is the ability to give a feature importance score that helps limiting the number of variables used during analytics or prediction. The features below will be used for further exploratory data analysis .
feature_importance = abs(clf.coef_[0])
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)[0:20]
pos = np.arange(sorted_idx.shape[0]) + .5
featfig = plt.figure()
featax = featfig.add_subplot(1, 1, 1)
featax.barh(pos, feature_importance[sorted_idx], align='center')
featax.set_yticks(pos)
featax.set_yticklabels(np.array(X.columns)[sorted_idx], fontsize=8)
featax.set_xlabel('Relative Feature Importance')
plt.show()
This section contains a more detailed description and execution documentation of the approach to tackle this particular problem . The summary of this approach is basically data pre-processing tuning followed by an algorithmic refined depicted by the training classifiers pipeline and the further refinement processes.
The main preprocessing techniques used in this project were data normalization and target data imbalance resolution through augmentation.
As mentioned before a 7:3 ratio was used to select between testing and training data .
X=train[features]
y=train[target]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
For data normalization the Sklearn ScaledScaler library was used run a transform over the data pre-selected . As previously mentioned scaling was precisely used given the fact that some of the initial standard deviations results were high for some of the colummns >3 in some cases.
X_scaled = pd.DataFrame(StandardScaler().fit_transform(X))
New train and tests datasets were created with the scaled data to be used during refinment and decision around the best classifier.
X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled = train_test_split(X_scaled, y, test_size=0.3, random_state=0)
With regard to training dataset imbalance further analysis was executed over this particular problem to detect the minority class . The package SMOTE was used to handle category imbalanced data .
print("Before OverSampling, counts of label '1': {}".format(sum(y_train_scaled==1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train_scaled==0)))
sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train_scaled, y_train_scaled.ravel())
print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_res.shape))
print("After OverSampling, counts of label '1': {}".format(sum(y_train_res==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res==0)))
Since there was a need to run multiple classifier. Decided to investigate flexible approaches to run multiple classifiers in an easy manner and manageable manner. A promising approach after analysing a couple was using the suggestion available in this Kaggle kernel https://www.kaggle.com/jeffd23/10-classifier-showdown-in-scikit-learn .
The following decisions where made in order to create a scalable and streammlined approach for experiments :
def run_tabular_prediction_pipeline(X_train, X_test, y_train, y_test):
nn_batch_num = int((X_train.shape)[0]/100) # Magic guess number
input_dim_num = X_train.shape[1]
#define closure for NN callback
def baseline_model_nn():
# create model
model = Sequential()
model.add(Dense(8, input_dim=input_dim_num, activation='relu'))
model.add(Dense(2, activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
classifiers = [
RandomForestClassifier(),
LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial'),
GaussianNB(),
XGBClassifier(objective='binary:logistic', tree_method='gpu_hist'),
CatBoostClassifier(task_type = "GPU", verbose=False),
KerasClassifier(build_fn=baseline_model_nn, epochs=100, batch_size=nn_batch_num, verbose=0)
]
# Logging for Visual Comparison
log_cols=['Classifier', "AUC ROC"]
log = pd.DataFrame(columns=log_cols)
for clf in classifiers:
try:
clf.fit(X_train, y_train)
train_predictions = clf.predict(X_test)
roc_auc = roc_auc_score(y_test, train_predictions)
except:
#To handle failure situations
roc_auc = None
name = clf.__class__.__name__
print("="*30)
print(name)
print('****Results****')
print("AUC ROC :{}".format(roc_auc))
log_entry = pd.DataFrame([[name, roc_auc]], columns=log_cols)
log = log.append(log_entry)
return log
The function run_tabular_prediction_pipeline contains a list of classifiers as outlined in the algorithms section and a standard way to obtain the AUC ROC metric ( the chosen metric for the project) and allow us to run during each step of the refinement process to understand the improvement of each technique over the pipeline of classifiers.
This experiment involves running the classifier pipeline over all the data and collecting the required metric . Further improvements to the solution will be documented in the Refinement section.
run_tabular_prediction_pipeline(X_train, X_test, y_train, y_test)
The refinement section we will basically improve over the Experiment A by tying some of the pre-processing techniques until we reach point where we are happy with one or more of our candidate solutions.
In this section we run the data over the normalized data with the training dataasets produced by SKLearns standard scaler.
run_tabular_prediction_pipeline(X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled )
In this section we run the classifiers pipeline strategy of target class imbalance by oversampling the minority class.
run_tabular_prediction_pipeline(X_train_res, X_test_scaled, y_train_res, y_test_scaled )
The XGB classifier had an issue when handling the augment data format, special provisioning would have to be conducted to fix this problem. It was decided to remove from the Step 2 iteration. CatBoost classifier is by itself a representative of Gradient Boosting family of algorithms.
Most impressive candidate from a metrics perspective are the Logistic Regression AUC with significant gains over baseline and ahead in the metric perspective from the second one the KerasClassifier based on simple NN's . Since the data preprocessing perspective was heavily explored on the previous refinement steps at this stage we will revert to Hyper parameter tuning of the Logistic Regression solution.
The parameter tuning consisted of exploring the solution space of the following elements :
C - Regularization parameter solver - The method used to solved the regression equation
grid = { 'C': np.power(10.0, np.arange(-10, 10)) ,
'solver': ['newton-cg','lbfgs'],
}
clf = LogisticRegression(penalty='l2', random_state=777, max_iter=100, tol=10)
gs = GridSearchCV(clf, grid, scoring='roc_auc', cv=10)
gs.fit(X_train_res, y_train_res)
Following the training process we will execute the best classifier over the test data in order to capture the needed metrics for this project. The listing above presents the diferent tested and chosen parameters .
y_pred_grid = gs.predict(X_test_scaled)
print('****Results****')
roc_auc = roc_auc_score(y_test, y_pred_grid)
print("AUC ROC: {} ".format(roc_auc))
The hyperparemeter tuning process yieldied a tiny improvement over Step 2 but nonetheless the cross validations process brings into table a more robust classifier that generalizes better.
The results section of this project is intertwined in the previous section being the main results of this project the following :
In summary we were able to improve from the benchmark in terms of the chosen metric in excess of 20%, what is by itself a good result.
Model evaluation and validation happened on Step 3 of refinement using GridSearch with a 10 fold cross validation aproach.
The final model will definitely be the last Logistic Regression approach on Step 3 given the fact that it's much simpler and widely understood algorithm and can provide an easy and interpretable model for the future users of the solution depicted on this notebook . The Neural Network could have been tweaked a bit more but the Logistic Regression approach was favoured given it's simplicity.
This was definitely a very rewarding project from knowledge acquisition perspective. I had the opportunity to familiarize a bit more with the very rewarding world of Data Science competitions and learn with the community and available knowledge, certainly a source of info that will consider in my future projects.
Comparing the different models performance was at the central point of making a decision around the best model to execute this project.
It becomes clear from the visualisation below that Logistic Regression and KerasClassifier from a metrics AUC ROC perspective are the ones that gain the most with the processing techniques used in this project.
import itertools
steps = [1,2,3] + [1,2,3] + [1,2,3] + [1,2,3] + [1,2,3]
classifiers = ["RandomForestClassifier", "LogisticRegression","GaussianNB","CatBoostClassifier","KerasClassifier"]
classifiers = list(itertools.chain.from_iterable(itertools.repeat(x, 3) for x in classifiers))
values = [ 0.506549 ,0.507121,0.529920, #RandomForestClassifier
0.608911 ,0.628388,0.774191, #LogisticRegression
0.670789 ,0.670871,0.525620, #GaussianNB
0.645972,0.646054,0.658729, #CatBoostClassifier
0.606637,0.634216,0.749112 #KerasClassifier
]
classifier_vis_df = pd.DataFrame({'step':steps, 'classifier':classifiers, 'value':values})
# reshape the data to get values by time for each label
classifier_vis_df = classifier_vis_df.pivot(index='step', columns='classifier', values='value')
classifier_vis_df.plot()
Main points of reflection:
This was a major independent undertanking at personal level in the Data Science world, very distinct experience than the rest of the projects of the course where guidance was readily available on the problem description.
An important take away of this project is that simpler methods can outperform almost out the box the more advanced and modern techniques, for example : boosting and deep learning.
Found this project particularly challenging and rewarding at same time given the fact that i had to participate for the first time in a Kaggle competition.
During the execution of the project noticed that had to make important tradeoffs of some of the initial ideas on the proposal. For example : the focus on the software tools, deeper investigation of deep learning methods and couple of other approaches that didn't prove to be very relevant to the conclusion of the project.
One significant learnign for me was on a couple of the algorithms was the ability to use GPU computing environment provided by Goolge Collab.
Possible improvements of this project are the following :